-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed install/remove of telegraf on non-systemd Debian/Ubuntu systems #2360
Conversation
@sparrc don't know what exactly fails in circle but an earlier build succeeded in my fork (only changed the commit message in between). can we do a quick release of telegraf 1.2.2 with this fix because we encountered this fixed issue with systemd while trying to deploy telegraf on some newer machines and we want to continue - but we need a new release with that fix!? that would be awesome! haven't updated the CHANGELOG in hope of an earlier release ;) |
sorry but I can't do one-off releases on-demand, you'll have to build it yourself. I think this change seems OK but you'll need to create corresponding PRs in the influxdb and kapacitor projects before I can merge it. |
okay no problem. will do the corresponding PR's in the other two projects. Rebased right now and made some more enhancements to the pre-install and post-remove scripts w. code copied from influxdb which is better For more clarification on the Debian Jessie w. SysVinit:
|
@sparrc @rossmcdonald i've made some more enhancements after comparing the scripts from telegraf, influxdb and kapacitor and enhanced all scripts of all three projects accordingly. i've referenced this PR in the other two PR too, so more information is available about the what and why. I hope it's fine for you. As soon as you tell me that i can add the PR as a link to the changelog (for 1.3.0?), ill do another rebase on all three projects accordingly. If there are any more questions, just ask! Btw: i've tested those changes manually locally as well on a Debian Jessie system with sysvinit-core installed and running (and systemd installed but not running). See last comment. PR on InfluxDB: influxdata/influxdb#7933 |
scripts/post-install.sh
Outdated
which systemctl &>/dev/null | ||
if [[ $? -eq 0 ]]; then | ||
install_systemd | ||
if pidof systemd &>/dev/null; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably check for pid 1 instead. systemd could be running on the system, but not be the init manager. Such as if it's running within a container.
A safe test would be if [[ "$(readlink /proc/1/exe)" == */systemd ]]; then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried that. Indeed it's better and i didn't knew before that systemd can also run but not be the init system. will change the code accordingly ;)
scripts/post-remove.sh
Outdated
fi | ||
elif [ "$1" == "remove" -o "$1" == "purge" ]; then | ||
elif [[ -f /etc/lsb-release ]]; then | ||
# Debian/Ubuntu logic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come in post-install.sh
, the debian/ubuntu logic is controlled by a test for /etc/debian_version
, but here it's a test for /etc/lsb-release
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Point. I've mostly copied code from influxdb/kapacitor scripts yesterday. I've fixed this now in the next rebased push and will also fix all that stuff in the other PR's of InfluxDB/Kapacitor
scripts/post-install.sh
Outdated
grep "^telegraf:" /etc/group &>/dev/null | ||
if [[ $? -ne 0 ]]; then | ||
if ! id telegraf &>/dev/null; then | ||
if grep "^telegraf:" /etc/group &>/dev/null; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reverses the logic from the previous code. Previous code evaluated as true if the match WASN'T found. This code evaluates as true if the match WAS found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh no....the code is right!! See, before: you execute "id telegraf" and if the user exists already, the return code is 0. The if statement will ONLY be executed if the exit code is NOT 0 eg. the user DOES NOT exist. For my code part it reads like this. If there is a user, negate the return code to non-zero and therefore the if statement is NOT executed because it only has to be executed if there is NO user named telegraf. So "id telegraf" returns NON-0 if there is no such user, the !
negates it to TRUE and the if block is executed, because there is no telegraf user. It's a bit weird but the code is correct ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phemmer was probably talking about the grep telegraf from /etc/group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh sure....Yeah there's a !
missing in front of the grep like one line above. will fix this immediately!
Does anyone have the resources to test this? I'm a bit hesitant to merge this because init/install changes always seem to cause problems. I haven't heard of any users having problems with the current scripts until this PR, which makes me think this change is only for supporting an uncommon setup. @martinseener why don't you just use systemd? |
@sparrc Can you tell me how i can build the deb out of this PR so i'll happily test this with a final .deb package. But i've already tested this by modifying the scripts from the recent package and it works. Well "uncommon"....maybe. Systemd is indeed the "default" in Debian Jessie and onwards BUT it's not that uncommon to continue using SysVinit with Jessie like an official debian wiki document points out (https://wiki.debian.org/systemd#Installing_without_systemd) There you can see how to use SysVinit instead of Systemd right from the installation. We use this throughout our Jessie Systems as we have several problems with Systemd. (One for ex. booting with PXE is not working here and we have tools that doesnt work with Systemd yet - eg. legacy tools used in companys) These are valid points why people still need to use SysVinit. The other point is that the scripts used at the moment are easily not well written - it's basically "bad" code as the checks used for "systemd"-usage with And people do have problems like you can see here on InfluxDB (Problem is old kernel using sysvinit on a recent ubuntu): influxdata/influxdb#7899 And the last point: This code actually doesn't break anything (i'll happily test this if you tell me how to build the deb from this PR) but it makes the code less error-prone on non-uniform systems. It basically adds support for non-uniform systems ;) |
could you show me other systems that use the same checks in their install scripts? I'm assuming we got
famous last words ;-) |
@sparrc Ok, for example. This is actually a Debian Jessie System with SysVinit installed like described in the debian.org wiki article (it's actually one of our staging systems in our company). When trying to remove telegraf from the latest official release it looks like this (because it can find systemctl but systemd is not running, it fails to remove!):
The actual script which is run here can be found in
You see, my script makes it possible to successfully remove telegraf on non-uniform debians again. If you tell me how to build the deb with my scripts ill also test the installation and upgrade procedure for you but iam pretty sure it works too! |
@martinseener I meant more that I'd like someone who didn't write the PR to test it. I'm sure that it works for you, but we get bit with "works on my machine" bugs all the time, especially related to install/setup. |
@sparrc So please tell me how i can build a new DEB Package out of this PR and i'll test it on a proper Jessie w. Systemd, without Systemd, Ubuntu with and without for you as well. I don't know who could possibly test my PR besides you and me. What do we do if we don't get a volunteer on this? BTW: The Circle-CI volunteer says, it works but i don't know if it tests those scripts ;) I'm trying to modify the latest deb by myself now - should be easier than building a new one. Then ill test this deb on some more machines and report back. |
@sparrc So i've built a debian package out of my branch with the changed scripts and called it telegraf-1.3.0 for this moment. I've attached it here, so everyone can test it on their own. Then i have tested it on Debian Jessie 8.8 with SysVinit without any errors. Install/Remove and Remove with Purge ran just fine. I've also tested Debian Jessie with systemd and Ubuntu 16.04.1 LTS with Systemd. I hope these are just enough tested systems. Tell me if you need more systems - and maybe which one. Below are the results from those 3 systems. Hope those can convince you that my changes are good and remove hassles from installing telegraf of non-conform systems like ours ;-) Results from our staging machine (Jessie/SysVinit) first (check OS and init system + install telegraf + remove telegraf + install again + remove with purge + try start + check processes):
That's it for our Jessie with SysVinit. Now Ubuntu 16.04.1 LTS with systemd:
Also Debian Jessie WITH Systemd (vanilla 8.7 netboot install):
|
When having systemd installed but not actively used (so systemd is not PID 1 eg. not running) we have to assume that we run sysv instead. So i’ve changed the check wether systemd is installed (`which systemctl` ex. on line 63 post-install.sh) to wether systemd is actually running (`pidof systemd`). If it’s not running we can safely fallback to the sysv method for install/remove because `pidof systemd` will exit 1 of systemd is not running. This also fixes problems when running Debian Jessie with package `sysvinit-core` installed and used instead of systemd! I’ve also optimzed the bash scripts a bit, so for example the return codes are not checked after a command has been run (line 28 post-install.sh) but the actual return code is actively checked (`if <command>; then` instead of `<command>; if [ $? -eq 0 ]; then` - for rationale see: https://github.com/koalaman/shellcheck/wiki/SC2181) Also changed the remove/purge OR (line 30 post-remove.sh) see: https://github.com/koalaman/shellcheck/wiki/SC2166 and made some minor indentation fixes
There was actually indeed another bug which a colleague with CentOS stumbled upon while installing my branch as an rpm. the |
@sparrc See last comment please. And here is also the test from CentOS 7 with systemd and the rpm build from this branch (rpm attached). This also works as expected ;) I really hope this is enough evidence that 1) my code works on several different systems 2) fixes the systemd/sysvinit handling as explained in my first commit 👍
telegraf-1.2.1.1~694955c.x86_64.rpm.zip And here is someone else who encountered this issue (with influxdb): influxdata/influxdb#7899 |
For more rationale, please see similar PR on telegraf: influxdata/telegraf#2360
According to last review changes in similar PR on influxdb: influxdata/influxdb#7933
install_init | ||
# Run update-rc.d or fallback to chkconfig if not available | ||
if which update-rc.d &>/dev/null; then | ||
install_update_rcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we try this? I expect redhat would never have these scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't know on which Debian-Derivate the .deb package is being executed on later and so you don't know if it uses update-rc.d or chkconfig in advance. this is why we need both (so this .deb package can be run on as most distros as possible!)
if which update-rc.d &>/dev/null; then | ||
install_update_rcd | ||
else | ||
install_chkconfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would we try this on debian? It looks like debian does have chkconfig, but the last version it was packaged for was jessie.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above. You shouldn't depend on users only using Debian or Ubuntu. There are a ton of derivates which can use either update-rc.d or chkconfig, so this script should be as flexible as possible. what's you real problem here? i made it really flexible and tested it on several distributions (debian and non-debian) that it works. and it does perfectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so if we want to add support for chkconfig here we should also support it in the post-remove script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're totally right, it's missing there. I'll rework the script and add support for it. Should be an easy quick fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielnelson chkconfig support for debian based systems in post-remove.sh has now been added
scripts/post-remove.sh
Outdated
# Debian/Ubuntu logic | ||
# Remove/purge | ||
rm -f /etc/default/telegraf | ||
if [[ "$1" != "upgrade" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the same logic as before (remove or purge), because there are other ways this script can be called such as failed-upgrade
, or abort-install
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it back to the old method as i totally agree with you!
because there can be other options like abort-install
Also added distribution-specific install/remove logic with code copied from telegraf/influxdb scripts. For more rationale, see: influxdata/telegraf#2360
Also added distribution-specific install/remove logic with code copied from telegraf/influxdb scripts. For more rationale, see: influxdata/telegraf#2360
When having systemd installed but not actively used (so systemd is not PID 1 eg. not running) we have to assume that we run sysv instead. So i’ve changed the check wether systemd is installed (
which systemctl
ex. on line 63 post-install.sh) to wether systemd is actually running (pidof systemd
). If it’s not running we can safely fallback to the sysv method for install/remove becausepidof systemd
will exit 1 of systemd is not running. This also fixes problems when running Debian Jessie with packagesysvinit-core
installed and used instead of systemd!I’ve also optimzed the bash scripts a bit, so for example the return codes are not checked after a command has been run (line 28 post-install.sh) but the actual return code is actively checked (
if <command>; then
instead of<command>; if [ $? -eq 0 ]; then
- for rationale see: https://github.com/koalaman/shellcheck/wiki/SC2181)Also changed the remove/purge OR (line 30 post-remove.sh) see: https://github.com/koalaman/shellcheck/wiki/SC2166 and made some minor indentation fixes
Required for all PRs: